47 research outputs found
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Large language models (LLMs) like ChatGPT have revealed amazing intelligence.
How to evaluate the question-solving abilities of LLMs and their degrees of
intelligence is a hot-spot but challenging issue. First, the question-solving
abilities are interlaced with different ability branches like understanding and
massive knowledge categories like mathematics. Second, the inputs of questions
are multimodal that may involve text and images. Third, the response format of
LLMs is diverse and thus poses great challenges for result extraction and
evaluation. In this paper, we propose AGIBench -- a multi-granularity,
multimodal, human-referenced, and auto-scoring benchmarking methodology for
LLMs. Instead of a collection of blended questions, AGIBench focuses on three
typical ability branches and adopts a four-tuple <ability branch, knowledge,
difficulty, modal> to label the attributes of each question. First, it supports
multi-granularity benchmarking, e.g., per-question, per-ability branch,
per-knowledge, per-modal, per-dataset, and per-difficulty level granularities.
Second, it contains multimodal input, including text and images. Third, it
classifies all the questions into five degrees of difficulty according to the
average accuracy rate of abundant educated humans (human-referenced). Fourth,
it adopts zero-shot learning to avoid introducing additional unpredictability
and provides an auto-scoring method to extract and judge the result. Finally,
it defines multi-dimensional metrics, including accuracy under the average,
worst, best, and majority voting cases, and repeatability. AGIBench is
publically available from \url{https://www.benchcouncil.org/agibench}.Comment: 14 page
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems
Now we live in an era of big data, and big data applications are becoming
more and more pervasive. How to benchmark data center computer systems running
big data applications (in short big data systems) is a hot topic. In this
paper, we focus on measuring the performance impacts of diverse applications
and scalable volumes of data sets on big data systems. For four typical data
analysis applications---an important class of big data applications, we find
two major results through experiments: first, the data scale has a significant
impact on the performance of big data systems, so we must provide scalable
volumes of data sets in big data benchmarks. Second, for the four applications,
even all of them use the simple algorithms, the performance trends are
different with increasing data scales, and hence we must consider not only
variety of data sets but also variety of applications in benchmarking big data
systems.Comment: 16 pages, 3 figure
Quality at the Tail
Practical applications employing deep learning must guarantee inference
quality. However, we found that the inference quality of state-of-the-art and
state-of-the-practice in practical applications has a long tail distribution.
In the real world, many tasks have strict requirements for the quality of deep
learning inference, such as safety-critical and mission-critical tasks. The
fluctuation of inference quality seriously affects its practical applications,
and the quality at the tail may lead to severe consequences. State-of-the-art
and state-of-the-practice with outstanding inference quality designed and
trained under loose constraints still have poor inference quality under
constraints with practical application significance. On the one hand, the
neural network models must be deployed on complex systems with limited
resources. On the other hand, safety-critical and mission-critical tasks need
to meet more metric constraints while ensuring high inference quality.
We coin a new term, ``tail quality,'' to characterize this essential
requirement and challenge. We also propose a new metric,
``X-Critical-Quality,'' to measure the inference quality under certain
constraints. This article reveals factors contributing to the failure of using
state-of-the-art and state-of-the-practice algorithms and systems in real
scenarios. Therefore, we call for establishing innovative methodologies and
tools to tackle this enormous challenge.Comment: 9 pages, 4 figure